This notebook was created for the course "Deep Learning with Actuarial Applications in R" of the Swiss Association of Actuaries (https://www.actuaries.ch/).
This notebook serves as accompanion to the tutorial “Nesting Classical Actuarial Models into Neural Networks”, available on SSRN.
The code is similar to the code used in above tutorial and combines the raw R code in the scripts, available on GitHub along with some more comments. Please refer to the tutorial for explanations.
Note that the results might vary depending on the R and Python package versions, see last section for the result of sessionInfo() and corresponding info on the Python setup.
The tutorial uses the French MTPL data set available on openML (ID 41214).
library(mgcv)
library(keras)
library(magrittr)
library(dplyr)
library(tibble)
library(purrr)
library(ggplot2)
library(gridExtra)
library(splitTools)
library(tidyr)
options(encoding = 'UTF-8')
# set seed to obtain best reproducibility. note that the underlying architecture may affect results nonetheless, so full reproducibility cannot be guaranteed across different platforms.
seed <- 100
Sys.setenv(PYTHONHASHSEED = seed)
set.seed(seed)
reticulate::py_set_seed(seed)
tensorflow::tf$random$set_seed(seed)
The results below will not exactly match the results in the paper, since the underlying dataset and some packages are different. In addition the split into training and testing data is different as well. However, the general conclusions remain the same.
Subsequently, for ease of reading, we provide all the helper functions which are used in this tutorial in this section.
summarize <- function(...) suppressMessages(dplyr::summarize(...))
# Poisson deviance
PoissonDeviance <- function(pred, obs) {
200 * (sum(pred) - sum(obs) + sum(log((obs/pred)^(obs)))) / length(pred)
}
plot_freq <- function(out, xvar, title, model) {
ggplot(out, aes(x = !!sym(xvar), group = 1)) + geom_point(aes(y = pred, colour = model)) + geom_point(aes(y = obs, colour = "observed")) +
geom_line(aes(y = pred, colour = model), linetype = "dashed") + geom_line(aes(y = obs, colour = "observed"), linetype = "dashed") +
ylim(0, 0.35) + labs(x = xvar, y = "frequency", title = title) + theme(legend.position = "bottom")
}
plot_loss <- function(x) {
if (length(x)>1) {
df_val <- data.frame(epoch=1:length(x$loss),train_loss=x$loss,val_loss=x$val_loss)
df_val <- gather(df_val, variable, loss, -epoch)
p <- ggplot(df_val,aes(x=epoch,y=loss)) + geom_line() + facet_wrap(~variable,scales="free") + geom_smooth()
suppressMessages(print(p))
} else exit
}
We consider the data freMTPL2freq included in the R package CASdatasets for claim frequency modeling. This data comprises a French motor third-party liability (MTPL) insurance portfolio with corresponding claim counts observed in one accounting year. We do not incorporate claim sizes which would also be available through freMTPL2sev.
As the current package version provides a slightly amended dataset, we use an older dataset available on openML (ID 41214). Before we can use this data set we need to do some data cleaning. It has been pointed out by F. Loser that some claim counts do not seem to be correct. Hence, we use the pre-processing of the data described in the book "Statistical Foundations of Actuarial Learning and its Applications" in Appendix A.1. This pre-processed data can be downloaded from the course GitHub page here.
load("freMTPL2freq.RData")
A priori, there is not sufficient information about this data to do a sensible decision about the best consideration of the exposure measure, either as feature or as offset. In the following we treat the exposure always as an offset.
Data preprocessing includes a couple of transformations. We ensure that \(\texttt{ClaimNb}\) is an integer, \(\texttt{VehAge}\), \(\texttt{DrivAge}\) and \(\texttt{BonusMalus}\) have been capped for the plots at age 20, age 90 and bonus-malus level 150, respectively, to improve visualization. \(\texttt{Density}\) is logarithmized and \(\texttt{VehGas}\) is a categorical variable. We leave away the rounding used in the first notebook, which were mainly used for nicer visualizations of the data.
We are adding a group_id identifying rows possibly referring to the same policy. Respecting group_id in data splitting techniques (train/test, cross-validation) is essential. This is different to the tutorial where another splitting has been used. As a consequence, the figures in this notebook do not match the figures in the tutorial, but the conclusions drawn are the same.
In addition to the previous tutorial, we decide to truncate the \(\texttt{ClaimNb}\) and the $ in order to correct for unreasonable data entries and simplifications for the modeling part.
# Grouping id
distinct <- freMTPL2freq %>%
distinct_at(vars(-c(IDpol, Exposure, ClaimNb))) %>%
mutate(group_id = row_number())
dat <- freMTPL2freq %>%
left_join(distinct) %>%
mutate(ClaimNb = pmin(as.integer(ClaimNb), 4),
VehAge = pmin(VehAge,20),
DrivAge = pmin(DrivAge,90),
BonusMalus = pmin(BonusMalus,150),
Density = round(log(Density),2),
VehGas = factor(VehGas),
Exposure = pmin(Exposure, 1))
# Group sizes of suspected clusters
table(table(dat[, "group_id"]))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 429576 84201 13940 2437 966 754 720 475 400 269 142
## 12 13 14 15 18 22
## 191 3 1 2 1 1
As previously mentioned, typically features \(x_i\) need pre-processing before being used for a specific model. In our Poisson GLM the regression function is modeled by a log-linear shape in the continuous feature components. From the marginal empirical frequency plots in the previous file we see that such a log-linear form is not always appropriate. We make the following choices here:
Thus, we consider 3 continuous feature components (\(\texttt{Area}\), \(\texttt{BonusMalus}\), \(\texttt{log-Density}\)), 1 binary feature component (\(\texttt{VehGas}\)) and 5 categorical feature components (\(\texttt{VehPower}\), \(\texttt{VehAge}\), \(\texttt{DrivAge}\), \(\texttt{VehBrand}\), \(\texttt{Region}\)). The categorical classes for \(\texttt{VehPower}\), \(\texttt{VehAge}\) and \(\texttt{DrivAge}\) have been done based on expert opinion, only. This expert opinion has tried to find homogeneity within class labels (levels) and every class label should receive a sufficient volume (of observations). We could also make a data-driven choice by using a (marginal) regression tree for different feature components, see references in the tutorial.
dat2 <- dat %>% mutate(
AreaGLM = as.integer(Area),
VehPowerGLM = as.factor(pmin(VehPower,9)),
VehAgeGLM = cut(VehAge, breaks = c(-Inf, 0, 10, Inf), labels = c("1","2","3")),
DrivAgeGLM = cut(DrivAge, breaks = c(-Inf, 20, 25, 30, 40, 50, 70, Inf), labels = c("1","2","3","4","5","6","7")),
BonusMalusGLM = as.integer(pmin(BonusMalus, 150)),
DensityGLM = as.numeric(Density),
VehAgeGLM = relevel(VehAgeGLM, ref="2"),
DrivAgeGLM = relevel(DrivAgeGLM, ref="5"),
Region = relevel(Region, ref="R24")
)
We remark that for categorical variables we use the data type factor in R. This data type automatically considers dummy coding in the corresponding R procedures. Categorical variables are initialized to one class (reference level). We typically initialize to the class with the biggest volume. This initialization is achieved by the command relevel, see above. This initialization does not influence the fitted means but provides a unique parametrization. See ?relevel for further details.
head(dat2)
str(dat2)
## 'data.frame': 678007 obs. of 20 variables:
## $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ...
## $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
## $ Area : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
## $ VehPower : num 5 5 6 7 7 6 6 7 7 7 ...
## $ VehAge : num 0 0 2 0 0 2 2 0 0 0 ...
## $ DrivAge : num 55 55 52 46 46 38 38 33 33 41 ...
## $ BonusMalus : num 50 50 50 50 50 50 50 68 68 50 ...
## $ VehBrand : Factor w/ 11 levels "B1","B2","B3",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ VehGas : Factor w/ 2 levels "Diesel","Regular": 2 2 1 1 1 2 2 1 1 1 ...
## $ Density : num 7.1 7.1 3.99 4.33 4.33 8.01 8.01 4.92 4.92 4.09 ...
## $ Region : Factor w/ 22 levels "R24","R11","R21",..: 18 18 4 15 15 8 8 20 20 12 ...
## $ ClaimTotal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ClaimNb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ group_id : int 1 1 2 3 3 4 4 5 5 6 ...
## $ AreaGLM : int 4 4 2 2 2 5 5 3 3 2 ...
## $ VehPowerGLM : Factor w/ 6 levels "4","5","6","7",..: 2 2 3 4 4 3 3 4 4 4 ...
## $ VehAgeGLM : Factor w/ 3 levels "2","1","3": 2 2 1 2 2 1 1 2 2 2 ...
## $ DrivAgeGLM : Factor w/ 7 levels "5","1","2","3",..: 6 6 6 1 1 5 5 5 5 1 ...
## $ BonusMalusGLM: int 50 50 50 50 50 50 50 68 68 50 ...
## $ DensityGLM : num 7.1 7.1 3.99 4.33 4.33 8.01 8.01 4.92 4.92 4.09 ...
summary(dat2)
## IDpol Exposure Area VehPower
## Min. : 1 Min. :0.002732 A:103957 Min. : 4.000
## 1st Qu.:1157948 1st Qu.:0.180000 B: 75459 1st Qu.: 5.000
## Median :2272153 Median :0.490000 C:191880 Median : 6.000
## Mean :2621857 Mean :0.528547 D:151590 Mean : 6.455
## 3rd Qu.:4046278 3rd Qu.:0.990000 E:137167 3rd Qu.: 7.000
## Max. :6114330 Max. :1.000000 F: 17954 Max. :15.000
##
## VehAge DrivAge BonusMalus VehBrand
## Min. : 0.000 Min. :18.0 Min. : 50.00 B12 :166024
## 1st Qu.: 2.000 1st Qu.:34.0 1st Qu.: 50.00 B1 :162730
## Median : 6.000 Median :44.0 Median : 50.00 B2 :159861
## Mean : 6.976 Mean :45.5 Mean : 59.76 B3 : 53395
## 3rd Qu.:11.000 3rd Qu.:55.0 3rd Qu.: 64.00 B5 : 34753
## Max. :20.000 Max. :90.0 Max. :150.00 B6 : 28548
## (Other): 72696
## VehGas Density Region ClaimTotal
## Diesel :332136 Min. : 0.000 R24 :160601 Min. : 0
## Regular:345871 1st Qu.: 4.520 R82 : 84752 1st Qu.: 0
## Median : 5.970 R93 : 79315 Median : 0
## Mean : 5.982 R11 : 69791 Mean : 88
## 3rd Qu.: 7.410 R53 : 42122 3rd Qu.: 0
## Max. :10.200 R52 : 38751 Max. :4075401
## (Other):202675
## ClaimNb group_id AreaGLM VehPowerGLM VehAgeGLM
## Min. :0.00000 Min. : 1 Min. :1.00 4:115343 2:434492
## 1st Qu.:0.00000 1st Qu.:149318 1st Qu.:2.00 5:124821 1: 57739
## Median :0.00000 Median :273211 Median :3.00 6:148976 3:185776
## Mean :0.03891 Mean :275320 Mean :3.29 7:145401
## 3rd Qu.:0.00000 3rd Qu.:404072 3rd Qu.:4.00 8: 46956
## Max. :4.00000 Max. :534079 Max. :6.00 9: 96510
##
## DrivAgeGLM BonusMalusGLM DensityGLM
## 5:165185 Min. : 50.00 Min. : 0.000
## 1: 6816 1st Qu.: 50.00 1st Qu.: 4.520
## 2: 32079 Median : 50.00 Median : 5.970
## 3: 65594 Mean : 59.76 Mean : 5.982
## 4:170097 3rd Qu.: 64.00 3rd Qu.: 7.410
## 6:198871 Max. :150.00 Max. :10.200
## 7: 39365
First, we split the dataset into train and test. Due to the potential grouping of rows in policies we can not just do a random split. For this purpose, we use the function partition(...) from the splitTools package.
ind <- partition(dat2[["group_id"]], p = c(train = 0.8, test = 0.2),
seed = seed, type = "grouped")
train <- dat2[ind$train, ]
test <- dat2[ind$test, ]
It describes our choices of the learning data set \(\mathcal{D}\) and the test data set \(\mathcal{T}\) That is, we allocate at random 80% of the policies to \(\mathcal{D}\) and the remaining 20% of the policies to \(\mathcal{T}\).
Usually, an 90/10 or 80/20 is used for training and test data. This is a rule-of-thumb and best practice in modeling. A good explanation can be found here, citing as follows: "There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage."
# size of train/test
sprintf("Number of observations (train): %s", nrow(train))
## [1] "Number of observations (train): 542331"
sprintf("Number of observations (test): %s", nrow(test))
## [1] "Number of observations (test): 135676"
# Claims frequency of train/test
sprintf("Empirical frequency (train): %s", round(sum(train$ClaimNb) / sum(train$Exposure), 4))
## [1] "Empirical frequency (train): 0.0736"
sprintf("Empirical frequency (test): %s", round(sum(test$ClaimNb) / sum(test$Exposure), 4))
## [1] "Empirical frequency (test): 0.0736"
As we are going to compare various models, we create a table which stores the metrics we are going to use for the comparison and the selection of the best model.
# initialize table to store all model results for comparison
df_cmp <- tibble(
model = character(),
epochs = numeric(),
run_time = numeric(),
parameters = numeric(),
in_sample_loss = numeric(),
out_sample_loss = numeric(),
avg_freq = numeric(),
)
In the following chapters, we are going to fit various feed-forward neural networks for the data. At the end, we will compare the performance and quality of the several fitted models. We compare them by using the metrics defined above.
Before fitting any neural network, we fit a generalized linear model (GLM) which serves as baseline model.
In the following, we will fit various claim frequency models based on a Poisson assumption, to be more precise we make the following assumptions:
The main problem to be solved is to find the regression function \(\lambda(\cdot)\) such that it appropriately describes the data, and such that it generalizes to similar data which has not been seen, yet. Remark that the task of finding an appropriate regression function \(\lambda : \mathcal{X} \rightarrow \mathbb{R}_+\) also includes the defnition of the feature space \(\mathcal{X}\) which typically varies over different modeling approaches.
The defined feature components are continuous in nature, but we have been turning them into categorical ones for modeling purposes (as mentioned above). Having so much data, we can further explore these categorical feature components by trying to replace them by ordinal ones assuming an appropriate continuous functional form, still fitting into the GLM framework.
As an example we show how to bring DrivAge into a continuous functional form. We therefore modify the feature space \(\mathcal{X}\) and the regression function \(\lambda(\cdot)\) from (1.3). We replace the 7 categorical age classes by the following continuous function:
Thus, we replace the 7 categorical classes (involving 6 regression parameters from dummy coding) by the above continuous functional form having 5 regression parameters. The remaining parts of the regression function in (1.3) are kept unchanged
This model will be our baseline model with which we compare the subsequent fitted feed-forward neural networks.
exec_time <- system.time(
glm2 <- glm(ClaimNb ~ AreaGLM + VehPowerGLM + VehAgeGLM + BonusMalusGLM + VehBrand + VehGas + DensityGLM + Region +
DrivAge + log(DrivAge) + I(DrivAge^2) + I(DrivAge^3) + I(DrivAge^4),
data = train, offset = log(Exposure), family = poisson())
)
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 17.807 1.509 19.322 0.000 0.000
summary(glm2)
##
## Call:
## glm(formula = ClaimNb ~ AreaGLM + VehPowerGLM + VehAgeGLM + BonusMalusGLM +
## VehBrand + VehGas + DensityGLM + Region + DrivAge + log(DrivAge) +
## I(DrivAge^2) + I(DrivAge^3) + I(DrivAge^4), family = poisson(),
## data = train, offset = log(Exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4705 -0.3250 -0.2461 -0.1382 6.8980
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.653e+01 6.908e+00 11.079 < 2e-16 ***
## AreaGLM 4.681e-02 2.128e-02 2.199 0.027873 *
## VehPowerGLM5 5.782e-02 2.435e-02 2.374 0.017575 *
## VehPowerGLM6 9.069e-02 2.389e-02 3.796 0.000147 ***
## VehPowerGLM7 6.610e-02 2.377e-02 2.780 0.005428 **
## VehPowerGLM8 9.579e-02 3.381e-02 2.833 0.004605 **
## VehPowerGLM9 2.376e-01 2.655e-02 8.949 < 2e-16 ***
## VehAgeGLM1 -1.945e-02 3.419e-02 -0.569 0.569403
## VehAgeGLM3 -1.840e-01 1.633e-02 -11.272 < 2e-16 ***
## BonusMalusGLM 2.755e-02 4.127e-04 66.755 < 2e-16 ***
## VehBrandB2 -8.200e-03 1.933e-02 -0.424 0.671477
## VehBrandB3 5.923e-02 2.670e-02 2.219 0.026515 *
## VehBrandB4 5.602e-02 3.623e-02 1.546 0.122043
## VehBrandB5 8.473e-02 3.086e-02 2.746 0.006034 **
## VehBrandB6 1.326e-02 3.499e-02 0.379 0.704762
## VehBrandB10 8.286e-03 4.431e-02 0.187 0.851663
## VehBrandB11 1.860e-01 4.688e-02 3.967 7.27e-05 ***
## VehBrandB12 -2.505e-01 2.442e-02 -10.257 < 2e-16 ***
## VehBrandB13 5.417e-02 4.992e-02 1.085 0.277899
## VehBrandB14 -1.602e-01 9.794e-02 -1.636 0.101797
## VehGasRegular -1.569e-01 1.494e-02 -10.500 < 2e-16 ***
## DensityGLM 3.919e-02 1.581e-02 2.478 0.013211 *
## RegionR11 -9.825e-03 3.111e-02 -0.316 0.752112
## RegionR21 2.124e-03 1.314e-01 0.016 0.987099
## RegionR22 1.766e-01 6.414e-02 2.753 0.005906 **
## RegionR23 -4.110e-02 7.866e-02 -0.522 0.601323
## RegionR25 -3.692e-02 5.557e-02 -0.664 0.506376
## RegionR26 4.565e-02 6.128e-02 0.745 0.456283
## RegionR31 2.350e-02 4.055e-02 0.579 0.562303
## RegionR41 -1.508e-01 5.514e-02 -2.735 0.006241 **
## RegionR42 2.856e-02 1.168e-01 0.245 0.806763
## RegionR43 -1.427e-01 1.896e-01 -0.753 0.451682
## RegionR52 2.480e-02 3.201e-02 0.775 0.438538
## RegionR53 2.326e-02 2.952e-02 0.788 0.430749
## RegionR54 3.652e-02 4.265e-02 0.856 0.391850
## RegionR72 1.099e-01 3.728e-02 2.949 0.003185 **
## RegionR73 -1.697e-01 5.944e-02 -2.855 0.004304 **
## RegionR74 4.124e-01 7.951e-02 5.187 2.14e-07 ***
## RegionR82 2.234e-01 2.367e-02 9.441 < 2e-16 ***
## RegionR83 1.885e-02 9.388e-02 0.201 0.840865
## RegionR91 -2.618e-03 3.834e-02 -0.068 0.945544
## RegionR93 1.446e-01 2.669e-02 5.418 6.04e-08 ***
## RegionR94 1.518e-01 9.800e-02 1.549 0.121306
## DrivAge 3.874e+00 4.007e-01 9.670 < 2e-16 ***
## log(DrivAge) -4.661e+01 4.231e+00 -11.018 < 2e-16 ***
## I(DrivAge^2) -5.549e-02 6.728e-03 -8.248 < 2e-16 ***
## I(DrivAge^3) 4.399e-04 6.360e-05 6.917 4.61e-12 ***
## I(DrivAge^4) -1.388e-06 2.421e-07 -5.733 9.85e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 136692 on 542330 degrees of freedom
## Residual deviance: 130634 on 542283 degrees of freedom
## AIC: 171306
##
## Number of Fisher Scoring iterations: 6
Exercise: One could possible reduce the number of parameters by reducing the variables included. Do so and compare the result to the currently used model.
Exercise: This glm might be improved from a modeling perspective by excluding not significant variables.
The summary() function for a glm object provides the statistical tests of signifiance for every single parameter. However, with cateogorical variables the primary interest is to know if a categorical variables at all is significant. This can be done using the R fucntion drop1, see its help file for further details. It performs a Likelihood Ratio Test (LRT) which confirms that only the p-value for AreaGLM is above 5%.
drop1(glm2, test = "LRT")
Exercise: Consider a similar approach as DrivAge for another feature (e.g. BonusMalus)
# Predictions
train$fitGLM2 <- fitted(glm2)
test$fitGLM2 <- predict(glm2, newdata = test, type = "response")
dat$fitGLM2 <- predict(glm2, newdata = dat2, type = "response")
# in-sample and out-of-sample losses (in 10^(-2))
sprintf("100 x Poisson deviance GLM (train): %s", PoissonDeviance(train$fitGLM2, train$ClaimNb))
## [1] "100 x Poisson deviance GLM (train): 24.0874788981516"
sprintf("100 x Poisson deviance GLM (test): %s", PoissonDeviance(test$fitGLM2, test$ClaimNb))
## [1] "100 x Poisson deviance GLM (test): 24.1666108887935"
# Overall estimated frequency
sprintf("average frequency (test): %s", round(sum(test$fitGLM2) / sum(test$Exposure), 4))
## [1] "average frequency (test): 0.0737"
df_cmp %<>% bind_rows(
data.frame(model = "M1: GLM", epochs = NA, run_time = round(exec_time[[3]], 0), parameters = length(coef(glm2)),
in_sample_loss = round(PoissonDeviance(train$fitGLM2, as.vector(unlist(train$ClaimNb))), 4),
out_sample_loss = round(PoissonDeviance(test$fitGLM2, as.vector(unlist(test$ClaimNb))), 4),
avg_freq = round(sum(test$fitGLM2) / sum(test$Exposure), 4))
)
df_cmp
In addition to fitting and validating the model with a few metrics, it is important to check if the model is well calibrated across the feature space. E.g. it could be that the overall fit of a model is good, but that there are areas where the model under- and overestimates the claim frequencies. It is the goal of the subsequent calibration plots to ensure the proper fit along the whole feature space.
# Area
out <- test %>% group_by(Area) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitGLM2) / sum(Exposure))
p1 <- plot_freq(out, "Area", "frequency by area", "GLM")
# VehPower
out <- test %>% group_by(VehPower) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitGLM2) / sum(Exposure))
p2 <- plot_freq(out, "VehPower", "frequency by vehicle power", "GLM")
# VehBrand
out <- test %>% group_by(VehBrand) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitGLM2) / sum(Exposure))
p3 <- plot_freq(out, "VehBrand", "frequency by vehicle brand", "GLM")
# VehAge
out <- test %>% group_by(VehAge) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitGLM2) / sum(Exposure))
p4 <- plot_freq(out, "VehAge", "frequency by vehicle age", "GLM")
grid.arrange(p1, p2, p3, p4)
Based on the charts, no issues are detected and the model seems to be well calibrated.
Exercise: Perform the calibration with other variables not yet in the charts above.
In this chapter, we explain how the data need to be pre-processed to be used in neural networks. It can not be processed in the same way as shown above for GLMs. Further details can be found in this tutorial on SSRN, chapter 2.
We are going to highlight a few important points in data pre-processing that are necessary for a successful application of networks.
In network modeling the choice of the scale of the feature components may substantially influence the fitting procedure of the predictive model. Therefore, data pre-processing requires careful consideration. We treat unordered categorical (nominal) feature components and continuous (or ordinal) feature components separately. Ordered categorical feature components are treated like continuous ones, where we simply replace the ordered categorical labels by integers. Binary categorical feature components are coded by 0's and 1's for the two binary labels (for binary labels we do not distinguish between ordered and unordered components). Remark that if we choose an anti-symmetric activation function, i.e. \(-\phi(x) = \phi(-x)\), we may also set binary categorical feature components to \(\pm 1/2\), which may simplify initialization of optimization algorithms.
We need to transform (nominal) categorical feature components to numerical values. The most commonly used transformations are the so-called dummy coding and the one-hot encoding. Both methods construct binary representations for categorical labels. For dummy coding one label is chosen as reference level. Dummy coding then uses binary variables to indicate which label a particular policy possesses if it differs from the reference level. In our example we have two unordered categorical feature components, namely VehBrand and Region. We use VehBrand as illustration. It has 11 different labels \(\{B_1, B_{10}, B_{11}, B_{12}, B_{13}, B_{14}, B_2, B_3, B_4, B_5, B_6\}\). We choose \(B_1\) as reference label. Dummy coding then provides the coding scheme below (left). We observe that the 11 labels are replaced by 10-dimensional feature vectors \(\{0,1\}^{10}\), with components summing up to either 0 or 1.
In contrast to dummy coding, one-hot encoding does not choose a reference level, but uses an indicator for each label. In this way the 11 labels of VehBrand are replaced by the 11 unit vectors. The main difference between dummy coding and one-hot encoding is that the former leads to full rank design matrices, whereas the latter does not. This implies that under one-hot encoding there are identifiability issues in parametrizations. In network modeling identifiability is less important because we typically work in over-parametrized nonconvex optimization problems (with multiple equally good models/parametrizations); on the other hand, identifiability in GLMs is an important feature because one typically tries to solve a convex optimization problem, where the full rank property is important to efficiently and the (unique) solution.
Remark that other coding schemes could be used for categorical feature components such as Helmert's contrast coding. In classical GLMs the choice of the coding scheme typically does not influence the prediction, however, interpretation of the results may change by considering a different contrast. In network modeling the choice of the coding scheme may influence the prediction: typically, we exercise an early stopping rule in network calibrations. This early stopping rule and the corresponding result may depend on any chosen modeling strategy, such as the encoding scheme of categorical feature components.
Remark that dummy coding and one-hot encoding may lead to very high-dimensional input layers in networks, and it provides sparsity in input features. Moreover, the Euclidean distance between any two labels in the one-hot encoding scheme is that same. From natural language processing (NLP) we have learned that there are more efficient ways of representing categorical feature components, namely, by embedding them into lower-dimensional spaces so that proximity in these spaces has a useful meaning in the regression task. In networks this can be achieved by so-called embedding layers. In the context of our French MTPL example we refer to our next notebook.
In theory, continuous feature components do not need pre-processing if we choose a sufficiently rich network, because the network may take care of feature components living on different scales. This statement is of purely theoretical value. In practice, continuous feature components need pre-processing such that they all live on a similar scale and such that they are sufficiently equally distributed across this scale. The reason for this requirement is that the calibration algorithms mostly use gradient descent methods (GDMs). These GDMs only work properly, if all components live on a similar scale and, thus, all directions contribute equally to the gradient. Otherwise, the optimization algorithms may get trapped in saddle points or in regions where the gradients are at (also known as vanishing gradient problem). Often, one uses \([-1,+1]\) as the common scale because the/our choice of activation function is focused to that scale.
A popular transformation is the so-called MinMaxScaler. For this transformation we fix each continuous feature component of \(x\), say \(x_l\), at a time. Denote the minimum and the maximum of the domain of \(x_l\) by \(m_l\) and \(M_l\), respectively. The MinMaxScaler then replaces
In practice, it may happen that the minimum ml or the maximum Ml is not known. In this case one chooses the corresponding minimum and/or maximum of the features in the observed data. For prediction under new features one then needs to keep the original scaling of the initially observed data, i.e. the one which has been used for model calibration.
Remark that if we have outliers, the above transformations may lead to very concentrated transformed feature components \(x_l^*\), \(i=1,...,n\), because the outliers may, for instance, dominate the maximum in the MinMaxScaler. In this case, feature components should be transformed first by a log-transformation or by a quantile transformation so that they become more equally spaced (and robust) across the real line.
We observe that binary feature components (e.g. gender) are often embedded in the ML literature into a higher-dimensional space. However, we are of the opinion that this does not make sense. Hence, we suggest to set binary categorical feature components to \(\pm 1/2\).
As a rule of thumb one could formulate it as follows:
In our example we use dummy coding for the feature components VehBrand and Region. We use the MinMaxScaler for Area (after transforming \(\{A,...,F\}\) \(\mapsto\) \(\{1,...,6\}\)), VehPower, VehAge (after capping at age 20), DrivAge (after capping at age 90), BonusMalus (after capping at level 150) and Density (after first taking the log-transform). VehGas we transform to \(\pm 1/2\) and the volume Exposure \(\in (0,1]\) we keep untransformed.
Below the corresponding pre-processing functions:
# MinMax scaler
preprocess_minmax <- function(varData) {
X <- as.numeric(varData)
2 * (X - min(X)) / (max(X) - min(X)) - 1
}
# Dummy coding
preprocess_catdummy <- function(data, varName, prefix) {
varData <- data[[varName]]
X <- as.integer(varData)
n0 <- length(unique(X))
n1 <- 2:n0
addCols <- purrr::map(n1, function(x, y) {as.integer(y == x)}, y = X) %>%
rlang::set_names(paste0(prefix, n1))
cbind(data, addCols)
}
# Feature pre-processing using MinMax Scaler and Dummy Coding
preprocess_features <- function(data) {
data %>%
mutate_at(
c(AreaX = "Area", VehPowerX = "VehPower", VehAgeX = "VehAge",
DrivAgeX = "DrivAge", BonusMalusX = "BonusMalus", DensityX = "Density"),
preprocess_minmax
) %>%
mutate(
VehGasX = as.integer(VehGas) - 1.5,
VehBrandX = as.integer(VehBrand) - 1,
RegionX = as.integer(Region) - 1
) %>%
preprocess_catdummy("VehBrand", "Br") %>%
preprocess_catdummy("Region", "R")
}
dat2 <- preprocess_features(dat)
head(dat2)
str(dat2)
## 'data.frame': 678007 obs. of 55 variables:
## $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ...
## $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
## $ Area : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
## $ VehPower : num 5 5 6 7 7 6 6 7 7 7 ...
## $ VehAge : num 0 0 2 0 0 2 2 0 0 0 ...
## $ DrivAge : num 55 55 52 46 46 38 38 33 33 41 ...
## $ BonusMalus : num 50 50 50 50 50 50 50 68 68 50 ...
## $ VehBrand : Factor w/ 11 levels "B1","B2","B3",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ VehGas : Factor w/ 2 levels "Diesel","Regular": 2 2 1 1 1 2 2 1 1 1 ...
## $ Density : num 7.1 7.1 3.99 4.33 4.33 8.01 8.01 4.92 4.92 4.09 ...
## $ Region : Factor w/ 22 levels "R11","R21","R22",..: 18 18 3 15 15 8 8 20 20 12 ...
## $ ClaimTotal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ClaimNb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ group_id : int 1 1 2 3 3 4 4 5 5 6 ...
## $ fitGLM2 : num 0.00583 0.04486 0.04233 0.00459 0.0428 ...
## $ AreaX : num 0.2 0.2 -0.6 -0.6 -0.6 0.6 0.6 -0.2 -0.2 -0.6 ...
## $ VehPowerX : num -0.818 -0.818 -0.636 -0.455 -0.455 ...
## $ VehAgeX : num -1 -1 -0.8 -1 -1 -0.8 -0.8 -1 -1 -1 ...
## $ DrivAgeX : num 0.0278 0.0278 -0.0556 -0.2222 -0.2222 ...
## $ BonusMalusX: num -1 -1 -1 -1 -1 -1 -1 -0.64 -0.64 -1 ...
## $ DensityX : num 0.392 0.392 -0.218 -0.151 -0.151 ...
## $ VehGasX : num 0.5 0.5 -0.5 -0.5 -0.5 0.5 0.5 -0.5 -0.5 -0.5 ...
## $ VehBrandX : num 8 8 8 8 8 8 8 8 8 8 ...
## $ RegionX : num 17 17 2 14 14 7 7 19 19 11 ...
## $ Br2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br7 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br8 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br9 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Br10 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Br11 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R3 : int 0 0 1 0 0 0 0 0 0 0 ...
## $ R4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R7 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R8 : int 0 0 0 0 0 1 1 0 0 0 ...
## $ R9 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R10 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R11 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R12 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ R13 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R14 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R15 : int 0 0 0 1 1 0 0 0 0 0 ...
## $ R16 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R17 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R18 : int 1 1 0 0 0 0 0 0 0 0 ...
## $ R19 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R20 : int 0 0 0 0 0 0 0 1 1 0 ...
## $ R21 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ R22 : int 0 0 0 0 0 0 0 0 0 0 ...
summary(dat2)
## IDpol Exposure Area VehPower
## Min. : 1 Min. :0.002732 A:103957 Min. : 4.000
## 1st Qu.:1157948 1st Qu.:0.180000 B: 75459 1st Qu.: 5.000
## Median :2272153 Median :0.490000 C:191880 Median : 6.000
## Mean :2621857 Mean :0.528547 D:151590 Mean : 6.455
## 3rd Qu.:4046278 3rd Qu.:0.990000 E:137167 3rd Qu.: 7.000
## Max. :6114330 Max. :1.000000 F: 17954 Max. :15.000
##
## VehAge DrivAge BonusMalus VehBrand
## Min. : 0.000 Min. :18.0 Min. : 50.00 B12 :166024
## 1st Qu.: 2.000 1st Qu.:34.0 1st Qu.: 50.00 B1 :162730
## Median : 6.000 Median :44.0 Median : 50.00 B2 :159861
## Mean : 6.976 Mean :45.5 Mean : 59.76 B3 : 53395
## 3rd Qu.:11.000 3rd Qu.:55.0 3rd Qu.: 64.00 B5 : 34753
## Max. :20.000 Max. :90.0 Max. :150.00 B6 : 28548
## (Other): 72696
## VehGas Density Region ClaimTotal
## Diesel :332136 Min. : 0.000 R24 :160601 Min. : 0
## Regular:345871 1st Qu.: 4.520 R82 : 84752 1st Qu.: 0
## Median : 5.970 R93 : 79315 Median : 0
## Mean : 5.982 R11 : 69791 Mean : 88
## 3rd Qu.: 7.410 R53 : 42122 3rd Qu.: 0
## Max. :10.200 R52 : 38751 Max. :4075401
## (Other):202675
## ClaimNb group_id fitGLM2 AreaX
## Min. :0.00000 Min. : 1 Min. :0.0000632 Min. :-1.00000
## 1st Qu.:0.00000 1st Qu.:149318 1st Qu.:0.0118361 1st Qu.:-0.60000
## Median :0.00000 Median :273211 Median :0.0329438 Median :-0.20000
## Mean :0.03891 Mean :275320 Mean :0.0389177 Mean :-0.08412
## 3rd Qu.:0.00000 3rd Qu.:404072 3rd Qu.:0.0545373 3rd Qu.: 0.20000
## Max. :4.00000 Max. :534079 Max. :2.0223469 Max. : 1.00000
##
## VehPowerX VehAgeX DrivAgeX BonusMalusX
## Min. :-1.0000 Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1st Qu.:-0.8182 1st Qu.:-0.8000 1st Qu.:-0.55556 1st Qu.:-1.0000
## Median :-0.6364 Median :-0.4000 Median :-0.27778 Median :-1.0000
## Mean :-0.5537 Mean :-0.3024 Mean :-0.23620 Mean :-0.8049
## 3rd Qu.:-0.4545 3rd Qu.: 0.1000 3rd Qu.: 0.02778 3rd Qu.:-0.7200
## Max. : 1.0000 Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
##
## DensityX VehGasX VehBrandX RegionX
## Min. :-1.0000 Min. :-0.50000 Min. : 0.000 Min. : 0.00
## 1st Qu.:-0.1137 1st Qu.:-0.50000 1st Qu.: 1.000 1st Qu.: 4.00
## Median : 0.1706 Median : 0.50000 Median : 2.000 Median :11.00
## Mean : 0.1729 Mean : 0.01013 Mean : 3.398 Mean :10.29
## 3rd Qu.: 0.4529 3rd Qu.: 0.50000 3rd Qu.: 8.000 3rd Qu.:17.00
## Max. : 1.0000 Max. : 0.50000 Max. :10.000 Max. :21.00
##
## Br2 Br3 Br4 Br5
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.2358 Mean :0.07875 Mean :0.03714 Mean :0.05126
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Br6 Br7 Br8 Br9
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.04211 Mean :0.02612 Mean :0.02004 Mean :0.2449
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## Br10 Br11 R2 R3
## Min. :0.00000 Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.000000 Median :0.00000
## Mean :0.01796 Mean :0.005969 Mean :0.004463 Mean :0.01179
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.000000 Max. :1.00000
##
## R4 R5 R6 R7
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.01296 Mean :0.2369 Mean :0.01607 Mean :0.01547
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## R8 R9 R10 R11
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.04024 Mean :0.01916 Mean :0.003245 Mean :0.001956
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.000000
##
## R12 R13 R14 R15
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.05715 Mean :0.06213 Mean :0.02809 Mean :0.04621
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## R16 R17 R18 R19
## Min. :0.00000 Min. :0.000000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.000 Median :0.000000
## Mean :0.02528 Mean :0.006736 Mean :0.125 Mean :0.007798
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.000 Max. :1.000000
##
## R20 R21 R22
## Min. :0.0000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.0000 Median :0.000 Median :0.000000
## Mean :0.0528 Mean :0.117 Mean :0.006661
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000 Max. :1.000000
##
First, we split the dataset into train and test. Due to the potential grouping of rows in policies we can not just do a random split. For this purpose, we use the function partition(...) from the splitTools package.
ind <- partition(dat2[["group_id"]], p = c(train = 0.8, test = 0.2),
seed = seed, type = "grouped")
train <- dat2[ind$train, ]
test <- dat2[ind$test, ]
# size of train/test
sprintf("Number of observations (train): %s", nrow(train))
## [1] "Number of observations (train): 542331"
sprintf("Number of observations (test): %s", nrow(test))
## [1] "Number of observations (test): 135676"
# Claims frequency of train/test
sprintf("Empirical frequency (train): %s", round(sum(train$ClaimNb) / sum(train$Exposure), 4))
## [1] "Empirical frequency (train): 0.0736"
sprintf("Empirical frequency (test): %s", round(sum(test$ClaimNb) / sum(test$Exposure), 4))
## [1] "Empirical frequency (test): 0.0736"
In this section, we define objects and parameters which are used for all subsequent neural networks considered, independent of their network structure.
We need to define which components in the pre-processed dataset dat2 are used as input features. As we have added the pre-processed features appropriate for the neural networks to the original features, we only must use the relevant ones.
# select the feature space
features <- c("AreaX","VehPowerX","VehAgeX","DrivAgeX","BonusMalusX","DensityX","VehGasX",
"Br2","Br3","Br4","Br5","Br6","Br7","Br8","Br9","Br10","Br11",
"R2","R3","R4","R5","R6","R7","R8","R9","R10","R11","R12","R13","R14","R15","R16","R17","R18","R19","R20","R21","R22")
print(colnames(train[, features]))
## [1] "AreaX" "VehPowerX" "VehAgeX" "DrivAgeX" "BonusMalusX"
## [6] "DensityX" "VehGasX" "Br2" "Br3" "Br4"
## [11] "Br5" "Br6" "Br7" "Br8" "Br9"
## [16] "Br10" "Br11" "R2" "R3" "R4"
## [21] "R5" "R6" "R7" "R8" "R9"
## [26] "R10" "R11" "R12" "R13" "R14"
## [31] "R15" "R16" "R17" "R18" "R19"
## [36] "R20" "R21" "R22"
The input to keras requires the train and testing data to be of matrix format, including all features used in the matrix and already correctly pre-processed.
# feature matrix
Xtrain <- as.matrix(train[, features]) # design matrix training sample
Xtest <- as.matrix(test[, features]) # design matrix test sample
The choice of a particular network architecture and its calibration involve many steps. In each of these steps the modeler has to make certain decisions, and it may require that each of these decisions is revised several times in order to get the best (or more modestly a good) predictive model.
In this section we do not provide an explanation on the ones explicitly used below. We refer to this tutorial on SSRN, the notebook on FNN and further literature (provided below in the last chapter) to learn more about these hyperparameters and the way to choose the right architecture.
# available optimizers for keras
# https://keras.io/optimizers/
optimizers <- c('sgd', 'adagrad', 'adadelta', 'rmsprop', 'adam', 'adamax', 'nadam')
# homogeneous model (train)
lambda_hom <- sum(train$ClaimNb) / sum(train$Exposure)
Let us fit a deep neural network. In this chapter, we will not provide as many comments as above, as they remain valid for this chapter as well.
Before fitting and specifiying a neural network, we highly recommend to draw it. This helps in using the keras function.
We choose a network with \(K=3\) three hidden layer and \(q_1=20\), \(q_1=15\) and \(q_1=10\) neurons, which looks as follows:
The dimension of the input layer is defined by the selected input feature dimension (see above).
# define network
q0 <- length(features) # dimension of features
q1 <- 20 # number of neurons in first hidden layer
q2 <- 15 # number of neurons in second hidden layer
q3 <- 10 # number of neurons in second hidden layer
sprintf("Neural network with K=3 hidden layer")
## [1] "Neural network with K=3 hidden layer"
sprintf("Input feature dimension: q0 = %s", q0)
## [1] "Input feature dimension: q0 = 38"
sprintf("Number of hidden neurons first layer: q1 = %s", q1)
## [1] "Number of hidden neurons first layer: q1 = 20"
sprintf("Number of hidden neurons second layer: q2 = %s", q2)
## [1] "Number of hidden neurons second layer: q2 = 15"
sprintf("Number of hidden neurons third layer: q3 = %s", q3)
## [1] "Number of hidden neurons third layer: q3 = 10"
sprintf("Output dimension: %s", 1)
## [1] "Output dimension: 1"
Below we define the feed-forward shallow network with \(q_1\) hidden neurons and hyperbolic tangent activation function, the output has exponential activation function due to the Poisson GLM.
The exposure is included as an offset, and we use the homogeneous model for initializing the output.
In this notebook we do not provide further details on the layers used and the structure, we refer to other notebooks or the references at the end of this notebook. A good reference is the keras cheat sheet.
Design <- layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
LogVol <- layer_input(shape = c(1), dtype = 'float32', name = 'LogVol')
Network <- Design %>%
layer_dense(units = q1, activation = 'tanh', name = 'layer1') %>%
layer_dense(units = q2, activation = 'tanh', name = 'layer2') %>%
layer_dense(units = q3, activation = 'tanh', name = 'layer3') %>%
layer_dense(units = 1, activation = 'linear', name = 'Network',
weights = list(array(0, dim = c(q3, 1)), array(log(lambda_hom), dim = c(1))))
Response <- list(Network, LogVol) %>%
layer_add(name = 'Add') %>%
layer_dense(units = 1, activation = k_exp, name = 'Response', trainable = FALSE,
weights = list(array(1, dim = c(1, 1)), array(0, dim = c(1))))
model_dp <- keras_model(inputs = c(Design, LogVol), outputs = c(Response))
model_dp %>% compile(
loss = 'poisson',
optimizer = optimizers[7]
)
summary(model_dp)
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## Design (InputLayer) [(None, 38)] 0
## ________________________________________________________________________________
## layer1 (Dense) (None, 20) 780 Design[0][0]
## ________________________________________________________________________________
## layer2 (Dense) (None, 15) 315 layer1[0][0]
## ________________________________________________________________________________
## layer3 (Dense) (None, 10) 160 layer2[0][0]
## ________________________________________________________________________________
## Network (Dense) (None, 1) 11 layer3[0][0]
## ________________________________________________________________________________
## LogVol (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Add (Add) (None, 1) 0 Network[0][0]
## LogVol[0][0]
## ________________________________________________________________________________
## Response (Dense) (None, 1) 2 Add[0][0]
## ================================================================================
## Total params: 1,268
## Trainable params: 1,266
## Non-trainable params: 2
## ________________________________________________________________________________
This summary is crucial for a good understanding of the fitted model. It contains the total number of parameters and shows what the exposure is included as an offset (without training the corresponding weight).
For fitting a keras model and its arguments, we refer to the help of fit here. There you find details about the validation_split and the verbose argument.
If validation_split>0, then the training set is further subdivided into a new training and a validation set. The new training set is used for fitting the model and the validation set is used for (out-of-sample) validation. We emphasize that the validation set is chosen disjointly from the test data, as this latter data may still be used later for the choice of the optimal model (if, for instance, we need to decide between several networks).
With validation_split=0.2 we split the learning data 8:2 into training set and validation set. We fit the network on the training set and we out-of-sample validate it on the validation set.
# select number of epochs and batch.size
epochs <- 300
batch.size <- 10000
validation.split <- 0.2 # set to >0 to see train/validation loss in plot(fit)
verbose <- 1
# expected run-time on Renku 8GB environment around 200 seconds
exec_time <- system.time(
fit <- model_dp %>% fit(
list(Xtrain, as.matrix(log(train$Exposure))), as.matrix(train$ClaimNb),
epochs = epochs,
batch_size = batch.size,
validation_split = validation.split,
verbose = verbose
)
)
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 264.322 21.527 72.980 0.000 0.000
Let us illustrate the decrease of loss during the gradient descent algorithm. We provide two charts below - First one: Depending on the argument validation_split you see one curve or two curves (training and validation). - Second one: only shown if validation_split > 0
The plots help to determine the optimal number of epochs.
plot(fit)
plot_loss(x=fit[[2]])
You can see the following on the charts:
Below, we show two examples how the plots should ideally look like:
as follows, closely copy paste what you find there as comments:
In the first chart (citing the comment on the page), the training loss decreases with every epoch. That’s what you would expect when running a gradient-descent optimization – the quantity you’re trying to minimize should be less with every iteration. But that isn’t the case for the validation loss and accuracy: they seem to peak at the fourth epoch. This is an example of what we warned against earlier: a model that performs better on the training data isn’t necessarily a model that will do better on data it has never seen before. In precise terms, what you’re seeing is overfitting: after the second epoch, you’re overoptimizing on the training data, and you end up learning representations that are specific to the training data and don’t generalize to data outside of the training set.
In the second chart, the learning data \(\mathcal{D}\) is split into a training set and a validation set. The training set is used for fitting the model and the validation set is used for (out-of-sample) validation. We emphasize that we choose the validation set disjointly from the test data \(\mathcal{T}\). In the second figure the learning data is split 8:2 into training set (blue color) and validation set (green color). The network is fit on the training set and it is validated on the validation set. We observe that the model starts to over-fit to the learning data after roughly 150 epochs, since the validation loss (green color) starts to increase thereafter.
Comments: * The results for this data is quite clear and hence we can find the optimal number of epochs.
# Validation: Poisson deviance
train$fitdpNN <- as.vector(model_dp %>% predict(list(Xtrain, as.matrix(log(train$Exposure)))))
test$fitdpNN <- as.vector(model_dp %>% predict(list(Xtest, as.matrix(log(test$Exposure)))))
sprintf("100 x Poisson deviance shallow network (train): %s", PoissonDeviance(train$fitdpNN, train$ClaimNb))
## [1] "100 x Poisson deviance shallow network (train): 23.6492673265021"
sprintf("100 x Poisson deviance shallow network (test): %s", PoissonDeviance(test$fitdpNN, test$ClaimNb))
## [1] "100 x Poisson deviance shallow network (test): 24.0045443675194"
# average frequency
sprintf("Average frequency (test): %s", round(sum(test$fitdpNN) / sum(test$Exposure), 4))
## [1] "Average frequency (test): 0.0739"
trainable_params <- sum(unlist(lapply(model_dp$trainable_weights, k_count_params)))
df_cmp %<>% bind_rows(
data.frame(model = "M2: Deep Plain Network", epochs = epochs,
run_time = round(exec_time[[3]], 0), parameters = trainable_params,
in_sample_loss = round(PoissonDeviance(train$fitdpNN, train$ClaimNb), 4),
out_sample_loss = round(PoissonDeviance(test$fitdpNN, test$ClaimNb), 4),
avg_freq = round(sum(test$fitdpNN) / sum(test$Exposure), 4)
))
df_cmp
# Area
out <- test %>% group_by(Area) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitdpNN) / sum(Exposure))
p1 <- plot_freq(out, "Area", "frequency by area", "dpNN")
# VehPower
out <- test %>% group_by(VehPower) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitdpNN) / sum(Exposure))
p2 <- plot_freq(out, "VehPower", "frequency by vehicle power", "dpNN")
# VehBrand
out <- test %>% group_by(VehBrand) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitdpNN) / sum(Exposure))
p3 <- plot_freq(out, "VehBrand", "frequency by vehicle brand", "dpNN")
# VehAge
out <- test %>% group_by(VehAge) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitdpNN) / sum(Exposure))
p4 <- plot_freq(out, "VehAge", "frequency by vehicle age", "dpNN")
grid.arrange(p1, p2, p3, p4)
Is worthwhile to remark that the fit is quite close to the observations and that the neural networks smooths out some jumps in the observations for VehAge which is in line with expectations.
Exercise: Change the number of neurons, compare the number of parameters and the fitted results.
Exercise: Change the input feature space by removing for example Area, VehPower from the input and compare the number of network parameters and the fitted results.
Exercise: Read the documentation to the optimizers and run the subsequent analysis with different optimizers and compare the results.
Exercise: Change the random seeds (at the beginning of the tutorial) and compare the results.
Exercise Run the same analysis and see if you can fully reproduce the results. What do you conclude from that?
Exercise: Change the activiation function (only where it is appropriate) and compare the results.
Exercise: Change the validation.split and verbose argument and see the difference in the fitting of the model.
The network in the above figure (middle) shows for each feature component one single neuron in the input layer (blue, green and magenta colors), thus, an input layer of dimension \(q0 = 9\). However, we have two categorical feature components \(\texttt{VehBrand}\) and \(\texttt{Region}\) with more than 2 different categorical labels. One-hot encoding requires that these two components receive 11 and 22 input neurons, respectively. Thus, one-hot encoding implies that the input layer has dimension \(q0\) (if we assume that all other feature components need one single input neuron). This results in dimension r for the network. This is exactly the network illustrated in the figure (lhs), with one-hot encoding for \(\texttt{VehBrand}\) in green color and one-hot encoding for \(\texttt{Region}\) in magenta color. Brute-force network calibration then simply fits this model using a version of the gradient descent algorithm.
We should ask ourselves whether the brute-force implementation of categorical feature components using one-hot encoding is optimal, since it seems to introduce an excessive number of network parameters. There is a second reason why one-hot encoding seems to be sub-optimal for our purposes. In general, we would like to identify (cluster) labels that are similar for the regression modeling problem. This is not the case with one-hot encoding. If we consider, for instance, the 11 vehicle brands \(\mathcal{B} = \{B_1, B_{10},...,B_6\}\), one-hot encoding assigns a different unit vector to each VehBrand. For two different brands \(\texttt{VehBrand1} \ne \texttt{VehBrand2}\) we always receive a distance of \(\sqrt2\), thus, the (Euclidean) distance between all vehicle brands is the same under one-hot encoding. In tis section we present embedding layers which aim at embedding categorical feature components into low dimensional Euclidean spaces, clustering labels that are more similar for the regression modeling problem.
Recently, it has been proposed to use embedding layers for categorical feature components, and he has noted that this can lead to better results compared to one-hot encoding. Embedding layers are very common in natural language processing (NLP). Embedding layers are used in NLP in order to represent words by numerical coordinates in a low dimensional space. This approach has two advantages. First, the dimension is reduced (compared to one-hot encoding with a large sparse matrix). Second, similarities between words can be examined and provide additional insights to one-hot encoding.
In order not to rewrite everything, below an exact from the tutorial on the functioning of embeddings:
We are now going to compare the brute-force one-hot encoding and embedding layers. We choose a network of depth \(K = 3\) as shown above, having hidden neurons \((q_1, q_2, q_3) = (20, 15, 10)\), and using one-hot encoding for the feature components \(\texttt{VehBrand}\) and \(\texttt{Region}\), this network is illustrated in the figure above.
Below we define the non-embedded network parameters.
# definition of non-embedded features
features <- c("AreaX", "VehPowerX", "VehAgeX", "DrivAgeX", "BonusMalusX", "VehGasX", "DensityX")
q0 <- length(features) # number of non-embedded input features
q1 <- 20 # number of neurons in first hidden layer
q2 <- 15 # number of neurons in second hidden layer
q3 <- 10 # number of neurons in second hidden layer
sprintf("Neural network with K=3 hidden layer")
## [1] "Neural network with K=3 hidden layer"
sprintf("non-embedded feature dimension: q0 = %s", q0)
## [1] "non-embedded feature dimension: q0 = 7"
sprintf("Number of hidden neurons first layer: q1 = %s", q1)
## [1] "Number of hidden neurons first layer: q1 = 20"
sprintf("Number of hidden neurons second layer: q2 = %s", q2)
## [1] "Number of hidden neurons second layer: q2 = 15"
sprintf("Number of hidden neurons third layer: q3 = %s", q3)
## [1] "Number of hidden neurons third layer: q3 = 10"
sprintf("Output dimension: %s", 1)
## [1] "Output dimension: 1"
# training data
Xtrain <- as.matrix(train[, features]) # design matrix training sample
VehBrandtrain <- as.matrix(train$VehBrandX)
Regiontrain <- as.matrix(train$RegionX)
Ytrain <- as.matrix(train$ClaimNb)
# testing data
Xtest <- as.matrix(test[, features]) # design matrix test sample
VehBrandtest <- as.matrix(test$VehBrandX)
Regiontest <- as.matrix(test$RegionX)
Ytest <- as.matrix(test$ClaimNb)
The below lines differentiate the embedded NN from the embedded CANN
# choosing the right volumes for EmbNN
Vtrain <- as.matrix(log(train$Exposure))
Vtest <- as.matrix(log(test$Exposure))
lambda_hom <- sum(train$ClaimNb) / sum(train$Exposure)
sprintf("Empirical frequency (train): %s", round(lambda_hom, 4))
## [1] "Empirical frequency (train): 0.0736"
The above lines differentiate the embedded NN from the embedded CANN
Below we define the non-embedded network parameters.
# set the number of levels for the embedding variables
VehBrandLabel <- length(unique(train$VehBrandX))
RegionLabel <- length(unique(train$RegionX))
sprintf("Embedded VehBrand feature dimension: %s", VehBrandLabel)
## [1] "Embedded VehBrand feature dimension: 11"
sprintf("Embedded Region feature dimension: %s", RegionLabel)
## [1] "Embedded Region feature dimension: 22"
Below we define with \(d\) the number of embedding layers. \(d=2\) corresponds to the network structure above (rhs). We use the same embedding dimension for both categorical feature components.
# dimension embedding layers for categorical features
d <- 2
The code for designing the network architecture with embedding layers for the categorical explanatory variables VehBrand and Region is given below.
# define the network architecture
Design <- layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
VehBrand <- layer_input(shape = c(1), dtype = 'int32', name = 'VehBrand')
Region <- layer_input(shape = c(1), dtype = 'int32', name = 'Region')
LogVol <- layer_input(shape = c(1), dtype = 'float32', name = 'LogVol')
BrandEmb <- VehBrand %>%
layer_embedding(input_dim = VehBrandLabel, output_dim = d, input_length = 1, name = 'BrandEmb') %>%
layer_flatten(name='Brand_flat')
RegionEmb <- Region %>%
layer_embedding(input_dim = RegionLabel, output_dim = d, input_length = 1, name = 'RegionEmb') %>%
layer_flatten(name='Region_flat')
Network <- list(Design, BrandEmb, RegionEmb) %>% layer_concatenate(name = 'concate') %>%
layer_dense(units = q1, activation = 'tanh', name = 'hidden1') %>%
layer_dense(units = q2, activation = 'tanh', name = 'hidden2') %>%
layer_dense(units = q3, activation = 'tanh', name = 'hidden3') %>%
layer_dense(units = 1, activation = 'linear', name = 'Network',
weights = list(array(0, dim = c(q3,1)), array(log(lambda_hom), dim = 1)))
Response <- list(Network, LogVol) %>% layer_add(name = 'Add') %>%
layer_dense(units = 1, activation = k_exp, name = 'Response', trainable = FALSE,
weights = list(array(1, dim = c(1,1)), array(0, dim = 1)))
model_nn <- keras_model(inputs = c(Design, VehBrand, Region, LogVol), outputs = c(Response))
Let us compile the model, using the Poisson deviance loss function as objective function, and nadam as the optimizer, and we provide a summary of the network structure.
For further details, we refer to the help file of compile here.
model_nn %>% compile(
loss = 'poisson',
optimizer = optimizers[7]
)
summary(model_nn)
## Model: "model_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## VehBrand (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Region (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## BrandEmb (Embedding) (None, 1, 2) 22 VehBrand[0][0]
## ________________________________________________________________________________
## RegionEmb (Embedding) (None, 1, 2) 44 Region[0][0]
## ________________________________________________________________________________
## Design (InputLayer) [(None, 7)] 0
## ________________________________________________________________________________
## Brand_flat (Flatten) (None, 2) 0 BrandEmb[0][0]
## ________________________________________________________________________________
## Region_flat (Flatten) (None, 2) 0 RegionEmb[0][0]
## ________________________________________________________________________________
## concate (Concatenate) (None, 11) 0 Design[0][0]
## Brand_flat[0][0]
## Region_flat[0][0]
## ________________________________________________________________________________
## hidden1 (Dense) (None, 20) 240 concate[0][0]
## ________________________________________________________________________________
## hidden2 (Dense) (None, 15) 315 hidden1[0][0]
## ________________________________________________________________________________
## hidden3 (Dense) (None, 10) 160 hidden2[0][0]
## ________________________________________________________________________________
## Network (Dense) (None, 1) 11 hidden3[0][0]
## ________________________________________________________________________________
## LogVol (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Add (Add) (None, 1) 0 Network[0][0]
## LogVol[0][0]
## ________________________________________________________________________________
## Response (Dense) (None, 1) 2 Add[0][0]
## ================================================================================
## Total params: 794
## Trainable params: 792
## Non-trainable params: 2
## ________________________________________________________________________________
This summary is crucial for a good understanding of the fitted model. It contains the total number of parameters and shows what the exposure is included as an offset (without training the corresponding weight).
# select number of epochs and batch.size
epochs <- 500
batch_size <- 10000
validation_split <- 0.2 # set to >0 to see train/validation loss in plot(fit)
verbose <- 1
# fitting the neural network
# expected run-time on Renku 8GB environment around 50 seconds
exec_time <- system.time(
fit <- model_nn %>% fit(
list(Xtrain, VehBrandtrain, Regiontrain, Vtrain), Ytrain,
epochs = epochs,
batch_size = batch_size,
verbose = verbose,
validation_split = validation_split
)
)
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 386.443 39.574 124.854 0.000 0.000
plot(fit)
plot_loss(x=fit[[2]])
# calculating the predictions
train$fitNN <- as.vector(model_nn %>% predict(list(Xtrain, VehBrandtrain, Regiontrain, Vtrain)))
test$fitNN <- as.vector(model_nn %>% predict(list(Xtest, VehBrandtest, Regiontest, Vtest)))
# average in-sample and out-of-sample losses (in 10^(-2))
sprintf("100 x Poisson deviance shallow network (train): %s", PoissonDeviance(train$fitNN, as.vector(unlist(train$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (train): 23.685939964396"
sprintf("100 x Poisson deviance shallow network (test): %s", PoissonDeviance(test$fitNN, as.vector(unlist(test$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (test): 23.8738697769126"
# average frequency
sprintf("Average frequency (test): %s", round(sum(test$fitNN) / sum(test$Exposure), 4))
## [1] "Average frequency (test): 0.0765"
# extract the number of trainable parameters from the keras model
tot_params <- count_params(model_nn)
trainable_params <- sum(unlist(lapply(model_nn$trainable_weights, k_count_params)))
nontrainable_params <- sum(unlist(lapply(model_nn$non_trainable_weights, k_count_params)))
df_cmp %<>% bind_rows(
data.frame(model = "M3: EmbNN", epochs = epochs,
run_time = round(exec_time[[3]], 0), parameters = trainable_params,
in_sample_loss = round(PoissonDeviance(train$fitNN, as.vector(unlist(train$ClaimNb))), 4),
out_sample_loss = round(PoissonDeviance(test$fitNN, as.vector(unlist(test$ClaimNb))), 4),
avg_freq = round(sum(test$fitNN) / sum(test$Exposure), 4)
))
df_cmp
Subsequently, we show the observed marginal frequencies and the underlying volumes.
# VehBrand
p1 <- ggplot(dat, aes(VehBrand)) +
geom_bar(aes(weight = Exposure)) +
labs(x = "Vehicle Brand", y = "exposure", title = "exposure by vehicle brand") +
theme(axis.text.x = element_text(size = 5))
out <- dat %>% group_by(VehBrand) %>% summarize(freq = mean(ClaimNb) / mean(Exposure))
p2 <- ggplot(out, aes(x = VehBrand, y = freq, group = 1)) +
geom_point() + geom_line(linetype = "dashed") +
ylim(0, 0.35) + labs(x = "VehBrand", y = "frequency", title = "frequency by vehicle brand") +
theme(axis.text.x = element_text(size = 5))
# Region
p3 <- ggplot(dat, aes(Region)) +
geom_bar(aes(weight = Exposure)) +
labs(x = "Region", y = "exposure", title = "exposure by region") +
theme(axis.text.x = element_text(angle = 90, size = 5))
out <- dat %>% group_by(Region) %>% summarize(freq = mean(ClaimNb) / mean(Exposure))
p4 <- ggplot(out, aes(x = Region, y = freq, group = 1)) +
geom_point() + geom_line(linetype = "dashed") +
ylim(0, 0.35) + labs(x = "Region", y = "frequency", title = "frequency by region") +
theme(axis.text.x = element_text(angle = 90, size = 5))
grid.arrange(p1, p2, p3, p4)
The embedding layers have an other advantage, namely, we can graphically illustrate the findings of the network (at least if \(d\) is small). This is very useful in NLP as it allows us to explore similar words graphically in 2 or 3 dimensions, after some further dimension reduction techniques have been applied.
We illustrate below the resulting embedding weights. We observe clustering in both categorical labels, which indicates that some labels could be merged. For \(\texttt{VehBrand}\) we observe that car brand \(B_{12}\) is diferent from all other car brands, \(B_{10}\) and \(B_{11}\) seem to have similarities, and the remaining car brands cluster. For \(\texttt{Region}\) the result is more diverse.
## VehBrand
emb_Brand <- (model_nn$get_layer("BrandEmb") %>% get_weights())[[1]] %>%
as.data.frame() %>%
add_column(levels(factor(dat2$VehBrand))) %>%
rlang::set_names(c("Dim_1", "Dim_2", "Label"))
ggplot(emb_Brand, aes(x = Dim_1, y = Dim_2, label = Label)) +
geom_point(size = 1) +
geom_text(aes(label = Label), hjust = 0, vjust = 0, size = 3) +
ggtitle("2-dim embedding: VehBrand")
## Region
emb_Region <- (model_nn$get_layer("RegionEmb") %>% get_weights())[[1]] %>%
as.data.frame() %>%
add_column(levels(factor(dat2$Region))) %>%
rlang::set_names(c("Dim_1", "Dim_2", "Label"))
ggplot(emb_Region, aes(x = Dim_1, y = Dim_2, label = Label)) +
geom_point(size = 1) +
geom_text(aes(label = Label), hjust = 0, vjust = 0, size = 3) +
ggtitle("2-dim embedding: Region")
# Area
out <- test %>% group_by(Area) %>% summarize(obs = sum(ClaimNb) / sum(Exposure),
pred = sum(fitNN) / sum(Exposure))
p1 <- plot_freq(out, "Area", "frequency by area", "EmbNN")
# VehPower
out <- test %>% group_by(VehPower) %>% summarize(obs = sum(ClaimNb) / sum(Exposure),
pred = sum(fitNN) / sum(Exposure))
p2 <- plot_freq(out, "VehPower", "frequency by vehicle power", "EmbNN")
# VehBrand
out <- test %>% group_by(VehBrand) %>% summarize(obs = sum(ClaimNb) / sum(Exposure),
pred = sum(fitNN) / sum(Exposure))
p3 <- plot_freq(out, "VehBrand", "frequency by vehicle brand", "EmbNN")
# VehAge
out <- test %>% group_by(VehAge) %>% summarize(obs = sum(ClaimNb) / sum(Exposure),
pred = sum(fitNN) / sum(Exposure))
p4 <- plot_freq(out, "VehAge", "frequency by vehicle age", "EmbNN")
grid.arrange(p1, p2, p3, p4)
Exercise: Change the embedding dimension \(d\) to another figure and compare the results.
Exercise: Change the embedding dimension \(d\) to \(>2\), and apply a dimension reduction technique to still plot the 2-dimensional charts.
Exercise: There are additional variables (which we made numerical above like Area) that could be candidates to be embedded. Choose some of them, embed them and compare the results.
We can draw the following conclusions:
In this section we combine the classical GLM and the network. This approach can be seen as rather universal because it applies to many other parametric regression problems. The idea is to nest the GLM into a network architecture.
The CANN approach of Model Assumptions 3.1 is illustrated below. The skip connection in orange color contains the GLM (note that for the moment we neglect that categorical feature components may use a different encoding for the GLM and the network parts).
We provide some remarks.
The second important ingredient is the following idea.
Note that initialization (3.2) exactly provides the MLE prediction of the GLM part of the CANN model (3.1), i.e. it minimizes the Poisson deviance loss. If we start the gradient descent algorithm for fitting the CANN model (3.1) in this initial value, and if we use the Poisson deviance loss as objective function, then the algorithm explores the network architecture for additional model structure that is not present in the GLM and which lowers the initial Poisson deviance loss related to the (initial) network parameter. In this way we obtain an improvement of the GLM by network features. This provides a more systematic way of using network architectures to improve the GLM. We will highlight this with several examples.
In the case of the Poisson distribution we can substantially simplify the CANN implementation. From (3.3) we see that if the GLM part is non-trainable with MLE, then we can merge this term with the given volumes \(vi\). We observe that this code has become much simpler and shown below.
Below we define the non-embedded network parameters.
# definition of non-embedded features
features <- c("AreaX", "VehPowerX", "VehAgeX", "DrivAgeX", "BonusMalusX", "VehGasX", "DensityX")
q0 <- length(features) # number of non-embedded input features
q1 <- 20 # number of neurons in first hidden layer
q2 <- 15 # number of neurons in second hidden layer
q3 <- 10 # number of neurons in second hidden layer
sprintf("Neural network with K=3 hidden layer")
## [1] "Neural network with K=3 hidden layer"
sprintf("non-embedded feature dimension: q0 = %s", q0)
## [1] "non-embedded feature dimension: q0 = 7"
sprintf("Number of hidden neurons first layer: q1 = %s", q1)
## [1] "Number of hidden neurons first layer: q1 = 20"
sprintf("Number of hidden neurons second layer: q2 = %s", q2)
## [1] "Number of hidden neurons second layer: q2 = 15"
sprintf("Number of hidden neurons third layer: q3 = %s", q3)
## [1] "Number of hidden neurons third layer: q3 = 10"
sprintf("Output dimension: %s", 1)
## [1] "Output dimension: 1"
# training data
Xtrain <- as.matrix(train[, features]) # design matrix training sample
VehBrandtrain <- as.matrix(train$VehBrandX)
Regiontrain <- as.matrix(train$RegionX)
Ytrain <- as.matrix(train$ClaimNb)
# testing data
Xtest <- as.matrix(test[, features]) # design matrix test sample
VehBrandtest <- as.matrix(test$VehBrandX)
Regiontest <- as.matrix(test$RegionX)
Ytest <- as.matrix(test$ClaimNb)
The below lines differentiate the embedded NN from the embedded CANN
# choosing the right volumes for CANN (difference to NN above!)
Vtrain <- as.matrix(log(train$fitGLM2))
Vtest <- as.matrix(log(test$fitGLM2))
lambda_hom <- sum(train$ClaimNb) / sum(train$fitGLM2)
sprintf("Empirical frequency (train): %s", round(lambda_hom, 4))
## [1] "Empirical frequency (train): 1"
The above lines differentiate the embedded NN from the embedded CANN
Below we define the non-embedded network parameters.
# set the number of levels for the embedding variables
VehBrandLabel <- length(unique(train$VehBrandX))
RegionLabel <- length(unique(train$RegionX))
sprintf("Embedded VehBrand feature dimension: %s", VehBrandLabel)
## [1] "Embedded VehBrand feature dimension: 11"
sprintf("Embedded Region feature dimension: %s", RegionLabel)
## [1] "Embedded Region feature dimension: 22"
Below we define with \(d\) the number of embedding layers. \(d=2\) corresponds to the network structure above (rhs). We use the same embedding dimension for both categorical feature components.
# dimensions embedding layers for categorical features
d <- 2
The code for designing the network architecture with embedding layers for the categorical explanatory variables VehBrand and Region is given below.
# define the network architecture
Design <- layer_input(shape = c(q0), dtype = 'float32', name = 'Design')
VehBrand <- layer_input(shape = c(1), dtype = 'int32', name = 'VehBrand')
Region <- layer_input(shape = c(1), dtype = 'int32', name = 'Region')
LogVol <- layer_input(shape = c(1), dtype = 'float32', name = 'LogVol')
BrandEmb <- VehBrand %>%
layer_embedding(input_dim = VehBrandLabel, output_dim = d, input_length = 1, name = 'BrandEmb') %>%
layer_flatten(name = 'Brand_flat')
RegionEmb <- Region %>%
layer_embedding(input_dim = RegionLabel, output_dim = d, input_length = 1, name = 'RegionEmb') %>%
layer_flatten(name = 'Region_flat')
Network <- list(Design, BrandEmb, RegionEmb) %>% layer_concatenate(name = 'concate') %>%
layer_dense(units = q1, activation = 'tanh', name = 'hidden1') %>%
layer_dense(units = q2, activation = 'tanh', name = 'hidden2') %>%
layer_dense(units = q3, activation = 'tanh', name = 'hidden3') %>%
layer_dense(units = 1, activation = 'linear', name = 'Network',
weights = list(array(0, dim = c(q3, 1)), array(log(lambda_hom), dim = 1)))
Response <- list(Network, LogVol) %>% layer_add(name = 'Add') %>%
layer_dense(units = 1, activation = k_exp, name = 'Response', trainable = FALSE,
weights = list(array(1, dim = c(1, 1)), array(0, dim = 1)))
model_cann <- keras_model(inputs = c(Design, VehBrand, Region, LogVol), outputs = c(Response))
model_cann %>% compile(
loss = 'poisson',
optimizer = optimizers[7]
)
summary(model_cann)
## Model: "model_2"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## VehBrand (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Region (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## BrandEmb (Embedding) (None, 1, 2) 22 VehBrand[0][0]
## ________________________________________________________________________________
## RegionEmb (Embedding) (None, 1, 2) 44 Region[0][0]
## ________________________________________________________________________________
## Design (InputLayer) [(None, 7)] 0
## ________________________________________________________________________________
## Brand_flat (Flatten) (None, 2) 0 BrandEmb[0][0]
## ________________________________________________________________________________
## Region_flat (Flatten) (None, 2) 0 RegionEmb[0][0]
## ________________________________________________________________________________
## concate (Concatenate) (None, 11) 0 Design[0][0]
## Brand_flat[0][0]
## Region_flat[0][0]
## ________________________________________________________________________________
## hidden1 (Dense) (None, 20) 240 concate[0][0]
## ________________________________________________________________________________
## hidden2 (Dense) (None, 15) 315 hidden1[0][0]
## ________________________________________________________________________________
## hidden3 (Dense) (None, 10) 160 hidden2[0][0]
## ________________________________________________________________________________
## Network (Dense) (None, 1) 11 hidden3[0][0]
## ________________________________________________________________________________
## LogVol (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Add (Add) (None, 1) 0 Network[0][0]
## LogVol[0][0]
## ________________________________________________________________________________
## Response (Dense) (None, 1) 2 Add[0][0]
## ================================================================================
## Total params: 794
## Trainable params: 792
## Non-trainable params: 2
## ________________________________________________________________________________
This summary is crucial for a good understanding of the fitted model. It contains the total number of parameters and shows what the exposure is included as an offset (without training the corresponding weight).
# select number of epochs and batch.size
epochs <- 500
batch_size <- 10000
validation_split <- 0.2 # set to >0 to see train/validation loss in plot(fit)
verbose <- 1
# fitting the neural network
# expected run-time on Renku 8GB environment around 50 seconds
exec_time <- system.time(
fit <- model_cann %>% fit(list(Xtrain, VehBrandtrain, Regiontrain, Vtrain), Ytrain,
epochs = epochs,
batch_size = batch_size,
verbose = verbose,
validation_split = validation_split)
)
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 384.041 38.849 124.335 0.000 0.000
plot(fit)
plot_loss(x=fit[[2]])
# calculating the predictions
train$fitCANN <- as.vector(model_cann %>% predict(list(Xtrain, VehBrandtrain, Regiontrain, Vtrain)))
test$fitCANN <- as.vector(model_cann %>% predict(list(Xtest, VehBrandtest, Regiontest, Vtest)))
# average in-sample and out-of-sample losses (in 10^(-2))
sprintf("100 x Poisson deviance shallow network (train): %s", PoissonDeviance(train$fitCANN, as.vector(unlist(train$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (train): 23.6721779400734"
sprintf("100 x Poisson deviance shallow network (test): %s", PoissonDeviance(test$fitCANN, as.vector(unlist(test$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (test): 23.9723402856499"
# average frequency
sprintf("Average frequency (test): %s", round(sum(test$fitCANN) / sum(test$Exposure), 4))
## [1] "Average frequency (test): 0.0773"
trainable_params <- sum(unlist(lapply(model_cann$trainable_weights, k_count_params)))
df_cmp %<>% bind_rows(
data.frame(model = "M4: EmbCANN", epochs = epochs,
run_time = round(exec_time[[3]], 0), parameters = trainable_params,
in_sample_loss = round(PoissonDeviance(train$fitCANN, as.vector(unlist(train$ClaimNb))), 4),
out_sample_loss = round(PoissonDeviance(test$fitCANN, as.vector(unlist(test$ClaimNb))), 4),
avg_freq = round(sum(test$fitCANN)/sum(test$Exposure), 4)
))
df_cmp
The embedding layers have an other advantage, namely, we can graphically illustrate the findings of the network (at least if \(d\) is small). This is very useful in NLP as it allows us to explore similar words graphically in 2 or 3 dimensions, after some further dimension reduction techniques have been applied.
We illustrate below the resulting embedding weights. We observe clustering in both categorical labels, which indicates that some labels could be merged. For \(\texttt{VehBrand}\) we observe that car brand \(B_{12}\) is diferent from all other car brands, \(B_{10}\) and \(B_{11}\) seem to have similarities, and the remaining car brands cluster. For \(\texttt{Region}\) the result is more diverse.
## VehBrand
emb_Brand <- (model_cann$get_layer("BrandEmb") %>% get_weights())[[1]] %>%
as.data.frame() %>%
add_column(levels(factor(dat2$VehBrand))) %>%
rlang::set_names(c("Dim_1", "Dim_2", "Label"))
ggplot(emb_Brand, aes(x = Dim_1, y = Dim_2, label = Label)) +
geom_point(size = 1) +
geom_text(aes(label = Label), hjust = 0, vjust = 0, size = 3) +
ggtitle("2-dim embedding: VehBrand")
## Region
emb_Region <- (model_cann$get_layer("RegionEmb") %>% get_weights())[[1]] %>%
as.data.frame() %>%
add_column(levels(factor(dat2$Region))) %>%
rlang::set_names(c("Dim_1", "Dim_2", "Label"))
ggplot(emb_Region, aes(x = Dim_1, y = Dim_2, label = Label)) +
geom_point(size = 1) +
geom_text(aes(label = Label), hjust = 0, vjust = 0, size = 3) +
ggtitle("2-dim embedding: Region")
# Area
out <- test %>% group_by(Area) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitCANN) / sum(Exposure))
p1 <- plot_freq(out, "Area", "Frequency by area", "CANN")
# VehPower
out <- test %>% group_by(VehPower) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitCANN) / sum(Exposure))
p2 <- plot_freq(out, "VehPower", "Frequency by vehicle power", "CANN")
# VehBrand
out <- test %>% group_by(VehBrand) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitCANN) / sum(Exposure))
p3 <- plot_freq(out, "VehBrand", "Frequency by vehicle brand", "CANN")
# VehAge
out <- test %>% group_by(VehAge) %>%
summarize(obs = sum(ClaimNb) / sum(Exposure), pred = sum(fitCANN) / sum(Exposure))
p4 <- plot_freq(out, "VehAge", "Frequency by vehicle age", "CANN")
grid.arrange(p1, p2, p3, p4)
We would like to emphasize that the CANN aproach is by no means restricted to the GLM. In fact, we can choose any regression model for the skip connection, for instance, we can replace the GLM prediction by a generalized additive model (GAM) prediction in the working weight de nition. This is exactly the idea behind the Poisson boosting machine. Please find further details in the tutorial on SSRN.
This model has not been discussed in the theoretical part and goes beyond the scope of the course, so do not hesitate to skip it. Details about this model can be found in the tutorial https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3320525, p. 23ff.
The marginal modeling used in Model GLM2 can be (slightly) improved, but it does not explain the big differences between Model GLM2 and the neural network models. Therefore, the major weakness of Model GLM2 compared to the neural network models must come from missing interactions in the former model. Note that in this former model all interactions are of multiplicative type between the feature components. This deficiency is going to be explored next.
For all our subsequent derivations we enhance Model GLM2 by improving the marginal modeling of the feature components VehAge and BonusMalus using a joint GAM adjustment. That is, we consider the regression function
where the first term on the right-hand side (scalar product) is the part originating from Model GLM2, and \(ns^2_1\) and \(ns^2_2\) are two natural cubic splines enhancing Model GLM2 by GAM features. We fit these two natural cubic splines simultaneously using the GAM framework and we call this improvement Model GAM1. The corresponding code is given below. We compress the data w.r.t. the two selected feature components VehAge and BonusMalus, and we fit the natural cubic splines for these two variables using the logged working weights \(log(v^{GLM}_i)\) as offsets.
exec_time <- system.time({
datGAM <- train %>% group_by(VehAge, BonusMalus) %>% summarize(fitGLM2 = sum(fitGLM2), ClaimNb = sum(ClaimNb))
gam1 <- gam(ClaimNb ~ s(VehAge, bs="cr") + s(BonusMalus, bs="cr"), data = datGAM, method = "GCV.Cp", offset = log(fitGLM2), family = poisson)
})
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 0.275 0.000 0.275 0.000 0.000
summary(gam1)
##
## Family: poisson
## Link function: log
##
## Formula:
## ClaimNb ~ s(VehAge, bs = "cr") + s(BonusMalus, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01309 0.01296 1.01 0.313
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(VehAge) 8.495 8.917 79.63 <2e-16 ***
## s(BonusMalus) 8.716 8.966 858.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.985 Deviance explained = 21.9%
## UBRE = 1.2258 Scale est. = 1 n = 1455
# Predictions
train$fitGAM1 <- exp(predict(gam1, newdata = train)) * train$fitGLM2
test$fitGAM1 <- exp(predict(gam1, newdata = test)) * test$fitGLM2
dat$fitGAM1 <- exp(predict(gam1, newdata = dat)) * dat$fitGLM2
# in-sample and out-of-sample losses (in 10^(-2))
sprintf("100 x Poisson deviance GLM (train): %s", PoissonDeviance(train$fitGAM1, train$ClaimNb))
## [1] "100 x Poisson deviance GLM (train): 23.9215405928445"
sprintf("100 x Poisson deviance GLM (test): %s", PoissonDeviance(test$fitGAM1, test$ClaimNb))
## [1] "100 x Poisson deviance GLM (test): 24.0234983219793"
# Overall estimated frequency
sprintf("average frequency (test): %s", round(sum(test$fitGAM1) / sum(test$Exposure), 4))
## [1] "average frequency (test): 0.0737"
df_cmp %<>% bind_rows(
data.frame(model = "M5: GAM1", epochs = NA,
run_time = round(exec_time[[3]], 0), parameters = length(coef(glm2)) + round(sum(pen.edf(gam1)), 1),
in_sample_loss = round(PoissonDeviance(train$fitGAM1, as.vector(unlist(train$ClaimNb))), 4),
out_sample_loss = round(PoissonDeviance(test$fitGAM1, as.vector(unlist(test$ClaimNb))), 4),
avg_freq = round(sum(test$fitGAM1) / sum(test$Exposure), 4)
))
df_cmp
We see the expected improvement in out-of-sample loss from GLM2 to GAM1 . However, there is still a big gap compared to the neural network approaches. Note that Model GAM1 is based on multiplicative interactions that we are going to challenge next.
This model has not been discussed in the theoretical part and goes beyond the scope of the course, so do not hesitate to skip it. Details about this model can be found in the tutorial https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3320525, p. 23ff.
We may explore missing interactions in the models considered above. As base model we choose Model GAM1, as follows:
In (3.11) we define two parallel neural networks that only interact in the last step where we concatenate them by adding up the terms. The reason for the choice is that we did not observe major interactions between the components of the two parallel networks (see tutorial).
# neural network definitions for model (3.11)
trainX <- list(as.matrix(train[, c("VehPowerX", "VehAgeX", "VehGasX")]),
as.matrix(train[, "VehBrandX"]),
as.matrix(train[, c("DrivAgeX", "BonusMalus")]),
as.matrix(log(train$fitGAM1)) )
testX <- list(as.matrix(test[, c("VehPowerX", "VehAgeX", "VehGasX")]),
as.matrix(test[, "VehBrandX"]),
as.matrix(test[, c("DrivAgeX", "BonusMalus")]),
as.matrix(log(test$fitGAM1)) )
neurons <- c(20, 15, 10)
no_labels <- length(unique(train$VehBrandX))
# definition of neural network (3.11)
Model2IA <- function(no_labels) {
Cont1 <- layer_input(shape = c(3), dtype = 'float32', name = 'Cont1')
Cat1 <- layer_input(shape = c(1), dtype = 'int32', name = 'Cat1')
Cont2 <- layer_input(shape = c(2), dtype = 'float32', name = 'Cont2')
LogExposure <- layer_input(shape = c(1), dtype = 'float32', name = 'LogExposure')
x_input <- c(Cont1, Cat1, Cont2, LogExposure)
Cat1_embed <- Cat1 %>%
layer_embedding(input_dim = no_labels, output_dim = 2, trainable = TRUE, input_length = 1, name = 'Cat1_embed') %>%
layer_flatten(name = 'Cat1_flat')
NNetwork1 <- list(Cont1, Cat1_embed) %>% layer_concatenate(name = 'cont') %>%
layer_dense(units = neurons[1], activation = 'tanh', name = 'hidden1') %>%
layer_dense(units = neurons[2], activation = 'tanh', name = 'hidden2') %>%
layer_dense(units = neurons[3], activation = 'tanh', name = 'hidden3') %>%
layer_dense(units = 1, activation = 'linear', name = 'NNetwork1',
weights = list(array(0, dim = c(neurons[3], 1)), array(0, dim = 1)))
NNetwork2 <- Cont2 %>%
layer_dense(units = neurons[1], activation = 'tanh', name = 'hidden4') %>%
layer_dense(units = neurons[2], activation = 'tanh', name = 'hidden5') %>%
layer_dense(units = neurons[3], activation = 'tanh', name = 'hidden6') %>%
layer_dense(units = 1, activation = 'linear', name = 'NNetwork2',
weights = list(array(0, dim = c(neurons[3], 1)), array(0, dim = 1)))
NNoutput <- list(NNetwork1, NNetwork2, LogExposure) %>% layer_add(name = 'Add') %>%
layer_dense(units = 1, activation = k_exp, name = 'NNoutput', trainable = FALSE,
weights = list(array(c(1), dim = c(1, 1)), array(0, dim = 1)))
model <- keras_model(inputs = x_input, outputs = c(NNoutput))
model %>% compile(optimizer = optimizer_nadam(),
loss = 'poisson')
model
}
model_gamplus <- Model2IA(no_labels)
summary(model_gamplus)
## Model: "model_3"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## Cat1 (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Cat1_embed (Embedding) (None, 1, 2) 22 Cat1[0][0]
## ________________________________________________________________________________
## Cont1 (InputLayer) [(None, 3)] 0
## ________________________________________________________________________________
## Cat1_flat (Flatten) (None, 2) 0 Cat1_embed[0][0]
## ________________________________________________________________________________
## cont (Concatenate) (None, 5) 0 Cont1[0][0]
## Cat1_flat[0][0]
## ________________________________________________________________________________
## Cont2 (InputLayer) [(None, 2)] 0
## ________________________________________________________________________________
## hidden1 (Dense) (None, 20) 120 cont[0][0]
## ________________________________________________________________________________
## hidden4 (Dense) (None, 20) 60 Cont2[0][0]
## ________________________________________________________________________________
## hidden2 (Dense) (None, 15) 315 hidden1[0][0]
## ________________________________________________________________________________
## hidden5 (Dense) (None, 15) 315 hidden4[0][0]
## ________________________________________________________________________________
## hidden3 (Dense) (None, 10) 160 hidden2[0][0]
## ________________________________________________________________________________
## hidden6 (Dense) (None, 10) 160 hidden5[0][0]
## ________________________________________________________________________________
## NNetwork1 (Dense) (None, 1) 11 hidden3[0][0]
## ________________________________________________________________________________
## NNetwork2 (Dense) (None, 1) 11 hidden6[0][0]
## ________________________________________________________________________________
## LogExposure (InputLayer) [(None, 1)] 0
## ________________________________________________________________________________
## Add (Add) (None, 1) 0 NNetwork1[0][0]
## NNetwork2[0][0]
## LogExposure[0][0]
## ________________________________________________________________________________
## NNoutput (Dense) (None, 1) 2 Add[0][0]
## ================================================================================
## Total params: 1,176
## Trainable params: 1,174
## Non-trainable params: 2
## ________________________________________________________________________________
# select number of epochs and batch_size
epochs <- 100
batch_size <- 10000
verbose <- 1
# expected run-time on Renku 8GB environment around 65 seconds
# may take a couple of minutes if epochs is more than 100
exec_time <- system.time(
fit <- model_gamplus %>% fit(trainX, as.matrix(train$ClaimNb),
epochs = epochs,
batch_size = batch_size,
verbose = verbose,
validation_data = list(testX, as.matrix(test$ClaimNb)))
)
exec_time[1:5]
## user.self sys.self elapsed user.child sys.child
## 133.691 13.070 34.636 0.000 0.000
plot(fit)
plot_loss(x=fit[[2]])
As both validation and training are decreasing, the number of epochs needs to be increased to find the optimal number of epochs.
# calculating the predictions
train$fitGAMPlus <- as.vector(model_gamplus %>% predict(trainX))
test$fitGAMPlus <- as.vector(model_gamplus %>% predict(testX))
# average in-sample and out-of-sample losses (in 10^(-2))
sprintf("100 x Poisson deviance shallow network (train): %s", PoissonDeviance(train$fitGAMPlus, as.vector(unlist(train$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (train): 23.8862938433159"
sprintf("100 x Poisson deviance shallow network (test): %s", PoissonDeviance(test$fitGAMPlus, as.vector(unlist(test$ClaimNb))))
## [1] "100 x Poisson deviance shallow network (test): 24.0248198368468"
# average frequency
sprintf("Average frequency (test): %s", round(sum(test$fitGAMPlus) / sum(test$Exposure), 4))
## [1] "Average frequency (test): 0.0742"
trainable_params <- sum(unlist(lapply(model_gamplus$trainable_weights, k_count_params)))
df_cmp %<>% bind_rows(
data.frame(model = "M6: GAM+", epochs = epochs,
run_time = round(exec_time[[3]], 0), parameters = trainable_params,
in_sample_loss = round(PoissonDeviance(train$fitGAMPlus, as.vector(unlist(train$ClaimNb))), 4),
out_sample_loss = round(PoissonDeviance(test$fitGAMPlus, as.vector(unlist(test$ClaimNb))), 4),
avg_freq = round(sum(test$fitGAMPlus) / sum(test$Exposure), 4)
))
df_cmp
We observe excellent fitting results of Model GAM+ compared to the other neural network models. This illustrates that in (3.11) we capture the main interaction terms.
plot_freq_multi <- function(xvar, title) {
out <- test %>% group_by(!!sym(xvar)) %>% summarize(
obs = sum(ClaimNb) / sum(Exposure),
glm = sum(fitGLM2) / sum(Exposure),
gam = sum(fitGAM1) / sum(Exposure),
gamp = sum(fitGAMPlus) / sum(Exposure)
)
ggplot(out, aes(x = !!sym(xvar), group = 1)) +
geom_point(aes(y = obs, colour = "observed")) + geom_line(aes(y = obs, colour = "observed"), linetype = "dashed") +
geom_point(aes(y = glm, colour = "GLM")) + geom_line(aes(y = glm, colour = "GLM"), linetype = "dashed") +
geom_point(aes(y = gam, colour = "GAM1")) + geom_line(aes(y = gam, colour = "GAM1"), linetype = "dashed") +
geom_point(aes(y = gamp, colour = "GAM+")) + geom_line(aes(y = gamp, colour = "GAM+"), linetype = "dashed") +
ylim(0, 0.35) + labs(x = xvar, y = "frequency", title = title) + theme(legend.position = "bottom")
}
# Area
p1 <- plot_freq_multi("Area", "frequency by area")
# VehPower
p2 <- plot_freq_multi("VehPower", "frequency by VehPower")
# VehBrand
p3 <- plot_freq_multi("VehBrand", "frequency by VehBrand")
# VehAge
p4 <- plot_freq_multi("VehAge", "frequency by VehAge")
grid.arrange(p1, p2, p3, p4)
So far, we have fitted three different network architecture and we are comparing their performance and quality of fit.
As mentioned above, one can amend and change hyperparameters (optimizer, batch size, epochs,...) and see how the results change.
One can also change the neural network architecture, e.g.
See:
Exercise: Have a look at the keras cheat sheet, study the normalization layer and add them and compare the results.
Exercise: Look at other layers, try to understand them and compare the impact of these layers.
Exercise: Apply ridge regularization, understand what it is and compare the results.
With fitting neural networks, there are a couple of questions arising, and we like to share some of our experience here:
Comparison of various metrics for the models.
df_cmp
We can draw the following conclusions:
plot_cmp <- function(xvar, title, maxlim = 0.35) {
out <- test %>% group_by(!!sym(xvar)) %>% summarize(
vol = sum(Exposure),
obs = sum(ClaimNb) / sum(Exposure),
glm = sum(fitGLM2) / sum(Exposure),
nn = sum(fitNN) / sum(Exposure),
cann = sum(fitCANN) / sum(Exposure),
gam = sum(fitGAM1) / sum(Exposure),
gamplus = sum(fitGAMPlus) / sum(Exposure)
)
max_pri <- max(out$obs, out$glm, out$nn, out$cann, out$gam, out$gamplus)
max_sec <- max(out$vol)
max_ratio <- max_pri / max_sec
if (is.null(maxlim)) maxlim <- max_pri
ggplot(out, aes(x = !!sym(xvar), group = 1)) +
geom_point(aes(y = obs, colour = "observed")) + geom_line(aes(y = obs, colour = "observed"), linetype = "dashed") +
geom_point(aes(y = glm, colour = "GLM")) + geom_line(aes(y = glm, colour = "GLM"), linetype = "dashed") +
geom_point(aes(y = nn, colour = "EmbNN")) + geom_line(aes(y = nn, colour = "EmbNN"), linetype = "dashed") +
geom_point(aes(y = cann, colour = "CANN")) + geom_line(aes(y = cann, colour = "CANN"), linetype = "dashed") +
geom_point(aes(y = gam, colour = "GAM")) + geom_line(aes(y = gam, colour = "GAM"), linetype = "dashed") +
geom_point(aes(y = gamplus, colour = "GAM+")) + geom_line(aes(y = gamplus, colour = "GAM+"), linetype = "dashed") +
geom_bar(aes(y = vol * (max_ratio)), colour = "grey", stat = "identity", alpha = 0.3) +
scale_y_continuous(name = "Frequency", sec.axis = sec_axis( ~ . / (max_ratio), name = "Exposure"), limits = c(0, maxlim)) +
labs(x = xvar, title = title) + theme(legend.position = "bottom", legend.text = element_text(size = 6))
}
# Area
p1 <- plot_cmp("Area", "Frequency and Exposure by Area")
# VehPower
p2 <- plot_cmp("VehPower", "Frequency and Exposure by VehPower")
# VehBrand
p3 <- plot_cmp("VehBrand", "Frequency and Exposure by VehBrand")
# VehAge
p4 <- plot_cmp("VehAge", "Frequency and Exposure by VehAge")
# DrivAge plot with exposure distribution
p5 <- plot_cmp("DrivAge", "Frequency and Exposure by DrivAge", maxlim = NULL)
grid.arrange(p1, p2, p3, p4, p5)
Some further exercises:
Exercise: Fit a CANN without embeddings (based on Model 2) and compare the results.
Exercise: Increase the number of epochs (requiring longer computation time) to find the optimal number of epochs for every model.
Out-of-sample claims frequency predictions (on log-scales) comparison.
axis_min <- log(max(test$fitCANN, test$fitNN))
axis_max <- log(min(test$fitCANN, test$fitNN))
ggplot(test, aes(x = log(fitNN), y = log(fitCANN), colour = Exposure)) + geom_point() +
geom_abline(colour = "#000000", slope = 1, intercept = 0) +
xlim(axis_max, axis_min) + ylim(axis_max, axis_min) +
labs(x = " EmbNN", y = "EmbCANN", title = "Claims frequency prediction (log-scale)") +
scale_colour_gradient(low = "green", high = "red")
axis_min <- log(max(test$fitCANN, test$fitGLM2))
axis_max <- log(min(test$fitCANN, test$fitGLM2))
ggplot(test, aes(x = log(fitGLM2), y = log(fitCANN), colour = Exposure)) + geom_point() +
geom_abline(colour = "#000000", slope = 1, intercept = 0) +
xlim(axis_max, axis_min) + ylim(axis_max, axis_min) +
labs(x = "GLM", y = "EmbCANN", title = "Claims frequency prediction (log-scale)") +
scale_colour_gradient(low = "green", high = "red")
The html is generated with the follow packages (which migth be slightly newer than the ones used in the published tutorial).
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.1.3 splitTools_0.3.1 gridExtra_2.3 ggplot2_3.3.3
## [5] purrr_0.3.4 tibble_3.1.0 dplyr_1.0.5 magrittr_2.0.1
## [9] keras_2.4.0 mgcv_1.8-33 nlme_3.1-152
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.18 tidyselect_1.1.0 xfun_0.22 bslib_0.2.4
## [5] splines_3.6.3 lattice_0.20-41 colorspace_2.0-0 vctrs_0.3.6
## [9] generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1 base64enc_0.1-3
## [13] utf8_1.1.4 rlang_0.4.10 jquerylib_0.1.3 pillar_1.6.0
## [17] glue_1.4.2 withr_2.4.1 DBI_1.1.1 rappdirs_0.3.3
## [21] lifecycle_1.0.0 tensorflow_2.4.0 stringr_1.4.0 munsell_0.5.0
## [25] gtable_0.3.0 evaluate_0.14 labeling_0.4.2 knitr_1.31
## [29] tfruns_1.5.0 fansi_0.4.2 highr_0.8 Rcpp_1.0.6
## [33] scales_1.1.1 jsonlite_1.7.2 farver_2.1.0 digest_0.6.27
## [37] stringi_1.5.3 grid_3.6.3 tools_3.6.3 sass_0.3.1
## [41] crayon_1.4.1 whisker_0.4 pkgconfig_2.0.3 zeallot_0.1.0
## [45] ellipsis_0.3.1 Matrix_1.3-2 assertthat_0.2.1 rmarkdown_2.7
## [49] R6_2.5.0 compiler_3.6.3
reticulate::py_config()
## python: /usr/bin/python3
## libpython: /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6.so
## pythonhome: //usr://usr
## version: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
## numpy: /usr/local/lib/python3.6/dist-packages/numpy
## numpy_version: 1.19.5
## tensorflow: /usr/local/lib/python3.6/dist-packages/tensorflow
##
## python versions found:
## /usr/bin/python3
## /usr/bin/python
tensorflow::tf_version()
## [1] '2.4'